Traffic congestion and the associated pollution are a blight on many of our big cities. One potential solution is to encourage individuals to use alternative means of transportation. In August 2013, the Bay Area Air Quality Management District launched the Bay Area Bike Share pilot program in five cities in California.
The aim of this project is to use the data from the pilot scheme to identify if it is possible to make a reasonable prediction about the number of bikes that might be used on any given day.
This first section of code sets up the knitr and sets the root directory
This secion of code sets up the necessary libraries
# Clear environment
rm(list=ls())
# Get necessary libraries
library(modelr)
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2 ✓ purrr 0.3.4
## ✓ tibble 3.0.3 ✓ dplyr 1.0.2
## ✓ tidyr 1.1.1 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.5.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(tictoc)
library(ggplot2)
library(leaflet)
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ggmap)
## Google's Terms of Service: https://cloud.google.com/maps-platform/terms/.
## Please cite ggmap if you use it! See citation("ggmap") for details.
library(tictoc)
library(ggrepel)
library(htmlwidgets)
This section of code downloads the information for stations, trips taken and the weather in the Bay Area for the period of the pilot study.
#download datafiles
#the original source requires setting up a Kaggle account.
#downloading files requires a Kaggle API to be on your machine.
#I have stored the files on a google drive, which should be accessible here:
# we use three different files - stations, weather and trips.
#depending on the speed of your connection, this should take c. 10 seconds
if(file.exists("station.csv"))
{stations<-read_csv(file="station.csv")} else
{idstations <- "1CTHRmjX7NVUfA3Ll705PI4U39MjxaZQP"
stations<-read.csv(
sprintf("https://docs.google.com/uc?id=%s&export=download",
idstations))
write.csv(stations, "station.csv", row.names=FALSE)}
## Parsed with column specification:
## cols(
## id = col_double(),
## name = col_character(),
## lat = col_double(),
## long = col_double(),
## dock_count = col_double(),
## city = col_character(),
## installation_date = col_character()
## )
if(file.exists("weather.csv"))
{weather<-read_csv(file="weather.csv")} else
{idweather<- "1ZTmAshBSHzjp_ltyjcDWTauSZmKrVyKG"
weather <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
idweather))
write.csv(weather, "weather.csv", row.names=FALSE)}
## Parsed with column specification:
## cols(
## .default = col_double(),
## date = col_character(),
## precipitation_inches = col_character(),
## events = col_character()
## )
## See spec(...) for full column specifications.
if(file.exists("trip.csv"))
{trips<- read_csv(file="trip.csv")} else
{idtrips<- "17ANzRpWAlbLJ-bAPYaUnCmdfyYWDcRQ6"
trips <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download",
idtrips))
write.csv(trip, "trip.csv", row.names=FALSE)}
## Parsed with column specification:
## cols(
## id = col_double(),
## duration = col_double(),
## start_date = col_character(),
## start_station_name = col_character(),
## start_station_id = col_double(),
## end_date = col_character(),
## end_station_name = col_character(),
## end_station_id = col_double(),
## bike_id = col_double(),
## subscription_type = col_character(),
## zip_code = col_double()
## )
## Warning: 11058 parsing failures.
## row col expected actual file
## 11506 zip_code no trailing characters -2585 'trip.csv'
## 15287 zip_code no trailing characters -2585 'trip.csv'
## 15409 zip_code no trailing characters -2585 'trip.csv'
## 21731 zip_code no trailing characters -2585 'trip.csv'
## 32654 zip_code no trailing characters -2585 'trip.csv'
## ..... ........ ...................... ...... ..........
## See problems(...) for more details.
#if you have a kaggle account then you can download the files at
#https://www.kaggle.com/benhamner/sf-bay-area-bike-share
#please download "station.csv", "weather.csv", and "trip.csv" and put them
#in your root directory. Then load the csv files as follows:
#load the stations data file
#stations<-read_csv(file="station.csv")
#load the weather data file
#weather<-read_csv(file="weather.csv")
#load the trips from data file
#trips<- read_csv(file="trip.csv")
Now that we have the data, it is helpful to see where the bicycle rental stations are located. The following code prepares a map that shows the bike rental are located in five cities in the Bay Area, San Francisco, Mountain View, Redwood City, Palo Alto and San Jose.
library(maps)
##
## Attaching package: 'maps'
## The following object is masked from 'package:purrr':
##
## map
# map the stations
bounds <- map('state', c('California'), fill=TRUE, plot=FALSE)
# A custom icon.
icons <- awesomeIcons(
icon = 'disc',
iconColor = 'black',
library = 'ion', # Options are 'glyphicon', 'fa', 'ion'.
markerColor = 'blue',
squareMarker = TRUE
)
map <- leaflet(data = stations) %>%
addProviderTiles("CartoDB.Positron", group = "Map") %>%
addProviderTiles("Esri.WorldImagery", group = "Satellite") %>%
addProviderTiles("Esri.WorldShadedRelief", group = "Relief") %>%
addMarkers(~long, ~lat, label = ~name, group = "Station") %>%
addPolygons(data=bounds, group="States", weight=2, fillOpacity = 0) %>%
addScaleBar(position = "bottomleft") %>%
addLayersControl(
baseGroups = c("Map", "Satellite", "Relief"),
overlayGroups = c("Stations", "States"),
options = layersControlOptions(collapsed = FALSE))
#invisible(print(map))
map
detach("package:maps", unload=TRUE) #detach maps package as it disables ability
#to run cross validation model
Our initial hypothesis is that weather might have an impact on ridership. The weather file needs to be tidied up a little before it can be used. Then the weather data for each city needs to be separated so that it can be matched to individual city trip information.
#sort out the date formatting so that files can be cross referenced
weather$date=as.Date(weather$date,format="%m/%d/%Y")
#fix the formatting issue with the precipitation so that is recognized as a
#numeric file.
weather$precipitation_inches <- as.numeric(weather$precipitation_inches)
## Warning: NAs introduced by coercion
weather$mean_temperature_f <-as.numeric(weather$mean_temperature_f)
#select the relevant columns from the weather file.
weather<-weather[,c(1,3,18,20,21,24)]
#filter by zipcode for each of the cities in the Bay Area
wsf <- filter(weather, zip_code == "94107")
wsj <- filter(weather, zip_code == "95113")
wpa <- filter(weather, zip_code == "94301")
wrc <- filter(weather, zip_code == "94063")
wmv <- filter(weather, zip_code == "94041")
The trip data also needs some attention. The trip data does not have the city name in the file, so this needs to be added from the station file. The data formatting also needs work and then the trips need to be separated into the individual cities. Finally, the trip data needs to be aggregated by day.
#identify city by matching start station id to station id file, looking up city
#and adding "startcityname" to the trips file.
trips$startcityname=stations$city[match(trips$start_station_id, stations$id)]
#sort out the date formatting so that files can be cross referenced by date
trips$date=as.Date(trips$start_date, format = "%m/%d/%Y")
#filter by city
tripssf<- filter(trips, startcityname=="San Francisco")
tripssj<- filter(trips, startcityname=="San Jose")
tripspa<- filter(trips, startcityname=="Palo Alto")
tripsrc<- filter(trips, startcityname=="Redwood City")
tripsmv<- filter(trips, startcityname=="Mountain View")
#aggregate the trip data by day, by city
#San Jose
totalsj<-tripssj%>%
group_by(date, startcityname)%>%
summarise(n=n())
## `summarise()` regrouping output by 'date' (override with `.groups` argument)
#San Francisco
totalsf<-tripssf%>%
group_by(date, startcityname)%>%
summarise(n=n())
## `summarise()` regrouping output by 'date' (override with `.groups` argument)
#Palo Alto
totalpa<-tripspa%>%
group_by(date, startcityname)%>%
summarise(n=n())
## `summarise()` regrouping output by 'date' (override with `.groups` argument)
#Mountain View
totalmv<-tripsmv%>%
group_by(date, startcityname)%>%
summarise(n=n())
## `summarise()` regrouping output by 'date' (override with `.groups` argument)
#Redwood City
totalrc<-tripsrc%>%
group_by(date,startcityname)%>%
summarise(n=n())
## `summarise()` regrouping output by 'date' (override with `.groups` argument)
Now that the trip and weather datasets are in good shape we can join them to create the dataset we need, namely trips and weather by date. The weather file has some days for which no weather reports are available, these are denoted as “T” in the dataset. We remove these and also remove the small number of NA’s in the dataset. Finally we combine the individual city data into one file for the whole of the Bay Area.
#join the datasets
sj_tidy<-left_join(totalsj,wsj, by=c("date"))
sf_tidy<-left_join(totalsf,wsf, by=c("date"))
pa_tidy<-left_join(totalpa,wpa, by=c("date"))
mv_tidy<-left_join(totalmv,wmv, by=c("date"))
rc_tidy<-left_join(totalrc,wrc, by=c("date"))
#remove the NA's
sf_tidy <- na.omit(sf_tidy)
#some days have no weather reports available, these are denoted as "T"
# remove all entries for these days.
sf_rain=sf_tidy[!sf_tidy$precipitation_inches=="T",]
sj_rain=sj_tidy[!sj_tidy$precipitation_inches=="T",]
pa_rain=pa_tidy[!pa_tidy$precipitation_inches=="T",]
mv_rain=mv_tidy[!mv_tidy$precipitation_inches=="T",]
rc_rain=rc_tidy[!rc_tidy$precipitation_inches=="T",]
#combine all the city files to make one Bay Area file.
rain<-rbind(sf_rain, sj_rain, pa_rain, mv_rain, rc_rain)
#remove all NA entries
rain<-na.omit(rain)
Now that we have the dataset we need, we start with a big picture view of ridership in the whole of the Bay Area. A scatter plot is chosen to represent this data.
#start with a big picture view of daily ridership in the Bay Area
g1<-ggplot(rain, aes(date, n)) + geom_point() +
ylab("Number of Trips")+xlab("Date") +
ggtitle("Number of Daily Bike Trips by Date in Bay Area")+
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))
plot(g1)
This chart suggests that there are at least three major groups of data, represented in three broad bands. Further analysis is required to understand the composition of these bands.
The following chart repeats the scatter plot but this time displays the cities of the Bay Area in different colors.
#repeat the chart but with cities broken out by city
g2<-ggplot(data=filter(rain),
aes(x=date,y=n,
color=as.factor(startcityname)))
g2<-g2+geom_point(size=1.)
g2<-g2+theme(legend.position="bottom" )
g2<-g2+ylab("Number of Trips")+xlab("Date") +
ggtitle("Number of Daily Bike Trips by Date in Bay Area")
g2 <- g2+ scale_color_discrete(name="City")+
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))
g2
This chart demonstrates that one reason for the three different bands of data is that the daily ridership in San Francisco is significantly higher than in the other cities of the Bay Area. Yet, even within San Francisco, the ridership falls into two distinct bands. Our hypothesis is that this may be due to different patterns of ridership on different days.
The following code adds the day of the week to the datasets and whether the day is a weekday or a weekend. Finally, the code also itemizes whether the day is a national holiday or not.
#add days of week to data sets
rain$day<- weekdays(as.Date(rain$date)) #whole dataset
sj_tidy$day <- weekdays(as.Date(sj_tidy$date)) # San Jose
sf_tidy$day <- weekdays(as.Date(sf_tidy$date)) #San Fran
mv_tidy$day <- weekdays(as.Date(mv_tidy$date)) #Mountain View
rc_tidy$day <- weekdays(as.Date(rc_tidy$date)) #Redwood City
pa_tidy$day <- weekdays(as.Date(pa_tidy$date)) #Palo Alto
#add whether a day of the week or weekend
weekdays1 <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday')
sj_tidy$wDay <- c('weekend', 'weekday')[(weekdays(sj_tidy$date) %in% weekdays1)+
1L]
sf_tidy$wDay <- c('weekend', 'weekday')[(weekdays(sf_tidy$date) %in% weekdays1)+
1L]
pa_tidy$wDay <- c('weekend', 'weekday')[(weekdays(pa_tidy$date) %in% weekdays1)+
1L]
mv_tidy$wDay <- c('weekend', 'weekday')[(weekdays(mv_tidy$date) %in% weekdays1)+
1L]
rc_tidy$wDay <- c('weekend', 'weekday')[(weekdays(rc_tidy$date) %in% weekdays1)+
1L]
#define whether a national holiday or not
holidays<- as.Date(c("2013-09-02","2013-11-11","2013-11-25",
"2013-12-25","2014-01-01","2014-01-20",
"2014-05-26","2014-07-04","2014-09-01",
"2014-11-11","2014-11-27","2014-12-25",
"2015-01-01", "2015-01-19",
"2015-05-25", "2015-07-03"))
#classify holidays in the datafiles
sj_tidy$holiday<- sj_tidy$date %in% holidays
sf_tidy$holiday<- sf_tidy$date %in% holidays
pa_tidy$holiday<- pa_tidy$date %in% holidays
mv_tidy$holiday<- mv_tidy$date %in% holidays
rc_tidy$holiday<- rc_tidy$date %in% holidays
Now that the data set is complete, the scatter plot analysis is re-run for San Francisco only, testing whether the type of day (weekday (colored in pink) v weekend (colored in aquamarine)) has any impact. In addition a regression line is included.
#chart bike usage differentiated by type of day (weekday/weekend)
sf_select <- sf_tidy[sf_tidy$wDay %in% c("weekend", "weekday"), ]
theme_set(theme_bw()) # pre-set the bw theme.
g <- ggplot(sf_select, aes(date, n)) +
labs(subtitle="Daily trips",
title="San Francisco Bike Share Usage")
g + geom_jitter(aes(col=wDay)) +
geom_smooth(aes(col=wDay), method="lm", se=F) +
ylab("No. of daily trips")+xlab("Date")+
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))
## `geom_smooth()` using formula 'y ~ x'
#add labels
This chart clearly shows that the type of day makes a difference to the level of daily ridership. Further, the lower than typical weekday ridership around certain times of the year (e.g. 2015-01) suggests that national holidays do make a difference. The chart also shows that weekday ridership increases as the pilot study develops, whereas weekend ridership decreases as the study progresses.
The next chart examines each individual day of the week to see if there are any discernible patterns. Note, the chart uses the “jitter” function to enable greater visibility of the data points which would otherwise be on top of each other.
#examine day of week pattern by day of week
gg1 <-ggplot(sf_tidy, aes(x=day, y=n))
gg1 <- gg1 + geom_boxplot(fill="blue", color="dark blue", alpha =0.3) +
geom_jitter(color = "red", alpha=0.2)+ #add jitter to show individual entries
ylab("No. of daily trips")+xlab("Day of the Week") +
ggtitle("San Francisco")+
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))#add labels
gg1
This box plot chart shows that Saturday and Sunday have significantly lower ridership with significantly lower variation around the mean.
The next part of the analysis considers whether weather has an impact on daily ridership. We start with temperature, splitting the data into the type of day.
#Chart ridership by temperature
g2<-ggplot(data=filter(sf_tidy),
aes(x=mean_temperature_f,y=n,
color=as.factor(wDay)))
g2<-g2+geom_point(size=1.5)
g2<-g2+theme(legend.position="right" ) +
geom_smooth(aes(col=wDay), method="lm", se=F)
g2<-g2+ylab("Number of Trips")+xlab("Mean Temperature in F") +
ggtitle("Number of Bike Trips and Temperature in San Francisco")
g2 <- g2+ scale_color_discrete(name="Day Type") +
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))
g2
## `geom_smooth()` using formula 'y ~ x'
The charts suggest that daily ridership increases as a function of temperature. For comparison purposes, the other cities in the Bay Area are reviewed on the same basis - i.e. does temperature appear to affect daily ridership?
#show the other cities in the Bay Area, excl San Francisco
#temperature
rainexsf <- filter(rain, startcityname != "San Francisco")
g5<-ggplot(filter(rainexsf, is.na(startcityname) == FALSE),
aes(x=mean_temperature_f,y=n, color=as.factor(startcityname)))
g5<-g5+geom_point(alpha=1,size=2)
g5<-g5+geom_smooth(method="lm",color="black")
g5<-g5+facet_wrap(~as.factor(startcityname),nrow=2)
g5<-g5+xlab("Temperature in Degrees F")+ylab("Number of Daily rides")
g5<-g5+theme(legend.position="none",panel.grid.minor=element_blank())
g5
## `geom_smooth()` using formula 'y ~ x'
These charts would appear to indicate that the effect of increased ridership with increased temperature can be seen in each of the other cities.
Next we consider precipitation. Starting with San Francisco:
#San Francisco
g1sf<-ggplot(sf_rain,aes(x=precipitation_inches,y=n)) #specify data and x and y
g1sf<-g1sf+geom_point()+ geom_smooth(method=lm)
g1sf<-g1sf+ylab("No. of daily trips")+xlab("Precipitation in inches")+
ggtitle("Number of Bike Trips and Precipitation in San Francisco") +
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))
g1sf #display result
## `geom_smooth()` using formula 'y ~ x'
This chart suggests that a negative relationship exists between daily precipitation and ridership. However, caution is required in interpretation.
There appears to be relatively few days with precipitation in San Francisco, and what precipitation does occur is typically less than one inch per day.
There is one outlier day with more than 3 inches of rain.
For comparison purposes, the analysis is re-run for each of the other cities in the Bay Area.
#show the other cities in the Bay Area, excl San Francisco
#precipitation
rainexsf<- filter(rain, startcityname != "San Francisco")
g5<-ggplot(filter(rainexsf, is.na(startcityname) == FALSE),
aes(x=precipitation_inches,y=n, color=as.factor(startcityname)))
g5<-g5+geom_point(alpha=1,size=2)
g5<-g5+geom_smooth(method="lm",color="black")
g5<-g5+facet_wrap(~as.factor(startcityname),nrow=2)
g5<-g5+xlab("Precipitation in Inches")+ylab("Number of Daily rides")
g5<-g5+theme(legend.position="none", panel.grid.minor=element_blank())
g5
## `geom_smooth()` using formula 'y ~ x'
These charts suggest much the same as San Francisco, namely that a negative relationship exists, but that caution in interpretation is required.
The weather data provides information on wind speed. The chart below shows daily ridership as a function of windspeed for San Francisco.
#San Francisco
g1sf<-ggplot(sf_rain, aes(x=mean_wind_speed_mph,y=n)) #specify data and x and y
g1sf<-g1sf+geom_point()+ geom_smooth(method=lm)
g1sf<-g1sf + ylab("No. of daily trips")+xlab("Mean Wind Speed") +
ggtitle("Number of Bike Trips and Wind Speed in San Francisco")+
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))#add labels
g1sf #display result
## `geom_smooth()` using formula 'y ~ x'
This chart suggests that wind speed has no effect on daily ridership levels in San Francisco. The charts below consider the impact in the other cities in the Bay Area.
#show the other cities in the Bay Area, excl San Francisco
#wind speed
g5<-ggplot(filter(rainexsf, is.na(startcityname) == FALSE),
aes(x=mean_wind_speed_mph,y=n, color=as.factor(startcityname)))
g5<-g5+geom_point(alpha=1,size=2)
g5<-g5+geom_smooth(method="lm",color="black")
g5<-g5+facet_wrap(~as.factor(startcityname),nrow=2)
g5<-g5+xlab("Mean Wind Speed MPH")+ylab("Number of Daily rides")
g5<-g5+theme(legend.position="none", panel.grid.minor=element_blank())
g5
## `geom_smooth()` using formula 'y ~ x'
With the exception of Mountain View these charts suggest that daily ridership is not affected by wind speed.
The weather data also has cloud cover information. Analysing this for San Francisco gives:
g1sf<-ggplot(sf_rain, aes(x=cloud_cover,y=n)) #specify data and x and y
g1sf<-g1sf+geom_point()+ geom_smooth(method=lm)
g1sf<-g1sf + ylab("No. of daily trips")+xlab("Cloud Cover") +
ggtitle("Number of Bike Trips and Cloud Cover in San Francisco")+
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))#add labels
g1sf #display result
## `geom_smooth()` using formula 'y ~ x'
This chart suggests there may be a slight negative impact on ridership of increased cloud cover. This data also suggests that cloud cover is a more frequent phenomenon than precipitation. This suggests that cloud cover is a factor that may be distinct from precipitation.
Running the analysis for the other cities in the Bay Area gives:
#show the other cities in the Bay Area, excl San Francisco
#cloud cover
g5<-ggplot(filter(rainexsf, is.na(startcityname) == FALSE),
aes(x=cloud_cover,y=n, color=as.factor(startcityname)))
g5<-g5+geom_point(alpha=1,size=2)
g5<-g5+geom_smooth(method="lm",color="black")
g5<-g5+facet_wrap(~as.factor(startcityname),nrow=2)
g5<-g5+xlab("Cloud Cover 1-8 scale")+ylab("Number of Daily rides")
g5<-g5+theme(legend.position="none", panel.grid.minor=element_blank())
g5
## `geom_smooth()` using formula 'y ~ x'
These charts suggest a slight negative impact for cloud cover on daily ridership levels.
Taking all of the above factors into consideration, we develop a model to try to predict whether ridership can be predicted from an assessment of the various factors likely to be in effect on the day.
The following regression model considers the following factors: Date - the progression of time through the pilot study. Day - the day of the week. Holiday - whether the day is a national holiday or not. Cloud Cover - the degree of cloud cover on the day. Precipiation Inches - the amount of rainfall during the day. Mean Temperature - the daily average temperature.
## Define the model's formula
mod1_formula<-formula(n ~ date + day + holiday + cloud_cover +
precipitation_inches + mean_temperature_f )
## Run the model against all of the data - use the stored formula
basic.mod<-lm(mod1_formula,
data=sf_tidy); summary(basic.mod)
##
## Call:
## lm(formula = mod1_formula, data = sf_tidy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -734.92 -62.00 13.30 92.86 550.22
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.246e+03 4.707e+02 -9.022 < 2e-16 ***
## date 2.857e-01 2.934e-02 9.738 < 2e-16 ***
## dayMonday 8.463e+01 2.258e+01 3.748 0.000194 ***
## daySaturday -5.792e+02 2.214e+01 -26.161 < 2e-16 ***
## daySunday -6.440e+02 2.208e+01 -29.163 < 2e-16 ***
## dayThursday 8.040e+01 2.244e+01 3.583 0.000365 ***
## dayTuesday 1.099e+02 2.248e+01 4.887 1.29e-06 ***
## dayWednesday 9.527e+01 2.217e+01 4.296 2.00e-05 ***
## holidayTRUE -5.208e+02 4.231e+01 -12.310 < 2e-16 ***
## cloud_cover -7.858e+00 2.900e+00 -2.710 0.006914 **
## precipitation_inches -3.629e+02 3.464e+01 -10.478 < 2e-16 ***
## mean_temperature_f 9.707e+00 9.836e-01 9.869 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 154.4 on 651 degrees of freedom
## Multiple R-squared: 0.8292, Adjusted R-squared: 0.8264
## F-statistic: 287.4 on 11 and 651 DF, p-value: < 2.2e-16
This regression shows an adjusted R-squared of 82.64%, with statistical significance for each of the factors at the 99% level.
The following code runs a cross validation of the model on the dataset using 10 folds.
#run the k-fold cross validation, select 30 folds
sf<-sf_tidy%>%
crossv_kfold(10)
sf
## # A tibble: 10 x 3
## train test .id
## <named list> <named list> <chr>
## 1 <resample> <resample> 01
## 2 <resample> <resample> 02
## 3 <resample> <resample> 03
## 4 <resample> <resample> 04
## 5 <resample> <resample> 05
## 6 <resample> <resample> 06
## 7 <resample> <resample> 07
## 8 <resample> <resample> 08
## 9 <resample> <resample> 09
## 10 <resample> <resample> 10
#calculate the rmse
rmse_mod1<-sf %>%
mutate(train = map(train, as_tibble))%>% #Convert to tibbles
mutate(model = map(train, ~ lm(mod1_formula,
data = .))) %>%
mutate(rmse = map2_dbl(model, test, rmse)) %>% #apply model, get RMSE
select(.id, rmse) ## pull just id and rmse
#show rmse results of model
summary(rmse_mod1$rmse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 128.4 144.0 152.6 155.4 164.0 202.1
#plot the rmse results
gg<-ggplot(rmse_mod1, aes(rmse))
gg<-gg + geom_density()+ ggtitle("K Fold cross validation")+
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))
gg
This data shows an rmse of c. 150, with a lower quartile of c. 120 and an upper quartile of c. 180. This suggests that the model will produce a daily ridership number that will be incorrect by an average of 150 riders per day.
Running a monte carlo simulation, 1,000 times with 20% of the data held out for testing gives (note, this should take around 7 seconds):
#run the monte carlo simulation cross validation
tic()
sf_cv<-sf_tidy%>%
crossv_mc(n=1000,test=.2)
sf_cv
## # A tibble: 1,000 x 3
## train test .id
## <list> <list> <chr>
## 1 <resample> <resample> 0001
## 2 <resample> <resample> 0002
## 3 <resample> <resample> 0003
## 4 <resample> <resample> 0004
## 5 <resample> <resample> 0005
## 6 <resample> <resample> 0006
## 7 <resample> <resample> 0007
## 8 <resample> <resample> 0008
## 9 <resample> <resample> 0009
## 10 <resample> <resample> 0010
## # … with 990 more rows
toc()
## 0.158 sec elapsed
#calculate rmse
mod1_rmse_cv<-sf_cv %>%
mutate(train = map(train, as_tibble)) %>% ## Convert to tibbles
mutate(model = map(train, ~ lm(mod1_formula, data = .)))%>%
mutate(rmse = map2_dbl(model, test, rmse))%>%
select(.id, rmse) ## pull just id and rmse
mod1_rmse_cv
## # A tibble: 1,000 x 2
## .id rmse
## <chr> <dbl>
## 1 0001 153.
## 2 0002 146.
## 3 0003 152.
## 4 0004 146.
## 5 0005 145.
## 6 0006 144.
## 7 0007 161.
## 8 0008 169.
## 9 0009 165.
## 10 0010 161.
## # … with 990 more rows
summary(mod1_rmse_cv$rmse)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 113.4 147.4 157.5 157.7 168.5 207.3
gg<-ggplot(mod1_rmse_cv, aes(rmse))
gg<-gg + geom_density()+ ggtitle("Monte Carlo cross validation")+
theme(axis.line=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12))
gg
These results are in alignment with the k-fold validation, i.e. indicating a typical error in the region of 150 riders per day.
To put this into context, it is worthwhile considering the average daily ridership in San Francisco during the period of this pilot study. This code also calculates the mean daily ridership in the Bay Area other than San Francisco.
#calculate the mean daily ridership in San Francisco
meansfrider1<-mean(sf_tidy$n)
meansfrider1
## [1] 818.1312
#calculate the mean daily ridership for Bay Area ex SF
meansfrider2<-mean(rainexsf$n)
meansfrider2
## [1] 22.89292
This section provides some additional information.
#produce a table of relative frequency of rain
wsf%>%
group_by(precipitation_inches)%>%
summarise(n=n())%>%
ungroup%>%
mutate(total =sum(n), rel.freq = n/total)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 45 x 4
## precipitation_inches n total rel.freq
## <dbl> <int> <int> <dbl>
## 1 0 586 733 0.799
## 2 0.01 16 733 0.0218
## 3 0.02 5 733 0.00682
## 4 0.03 2 733 0.00273
## 5 0.05 1 733 0.00136
## 6 0.06 1 733 0.00136
## 7 0.07 1 733 0.00136
## 8 0.08 2 733 0.00273
## 9 0.09 2 733 0.00273
## 10 0.1 2 733 0.00273
## # … with 35 more rows
The charts above indicated that it didn’t rain particularly frequently in San Francisco. The table below shows the actual frequency over the period of the pilot study. Almost 80% of the days in the study period had no precitipation at all.
The following table shows the most commonly used routes in the San Francisco study. The table shows that no one route dominates.
#produce a table of relative frequency of trips in the Bay Area
freqtriptable<-tripssf%>%
group_by(start_station_name, end_station_name)%>%
summarise(n=n())%>%
ungroup%>%
mutate(total =sum(n), rel.freq = n/total)
## `summarise()` regrouping output by 'start_station_name' (override with `.groups` argument)
arrange(freqtriptable, -rel.freq)
## # A tibble: 1,375 x 5
## start_station_name end_station_name n total rel.freq
## <chr> <chr> <int> <int> <dbl>
## 1 San Francisco Caltrain 2 (… Townsend at 7th 6216 603708 0.0103
## 2 Harry Bridges Plaza (Ferry… Embarcadero at Sansome 6164 603708 0.0102
## 3 Townsend at 7th San Francisco Caltrain (To… 5041 603708 0.00835
## 4 2nd at Townsend Harry Bridges Plaza (Ferry… 4839 603708 0.00802
## 5 Harry Bridges Plaza (Ferry… 2nd at Townsend 4357 603708 0.00722
## 6 Embarcadero at Sansome Steuart at Market 4269 603708 0.00707
## 7 Embarcadero at Folsom San Francisco Caltrain (To… 3967 603708 0.00657
## 8 Steuart at Market 2nd at Townsend 3903 603708 0.00647
## 9 2nd at South Park Market at Sansome 3627 603708 0.00601
## 10 San Francisco Caltrain (To… Harry Bridges Plaza (Ferry… 3622 603708 0.00600
## # … with 1,365 more rows
The table is a little difficult to digest, so the following chart was prepared. This shows the frequency of routes between different stations, mapped on the basis of their geographic co-ordinates.
#This section of code prepares a graph of most frequent routes,
#plotted based on their co-ordinates.#we need to identify the stations by
#their lat, long co-ords. #we do this by joining two tables - the trips and the
#stations. #firslty we isolate the trips and add the number of observations per
#day
tripssf2<-tripssf%>%
group_by(date, start_station_id)%>%
mutate(number=n_distinct (date))%>%
mutate(number_of_obs=n())
stationssf <- filter(stations, city=="San Francisco") #isolate the stations
#just in san francisco.
stationssf$end_station_id<- stationssf$id # create a new column to match the
#end stations.
names(stationssf)[names(stationssf)=="id"]<-"start_station_id" #name the column
#to be consistent across data tables for the start stations.
stationssf<-stationssf[,c(1,2,3,4,8)] #select columns needed.
sflatlon<-right_join(tripssf2,stationssf, by=c("start_station_id")) #combine
#trips and station information.
names(sflatlon)[names(sflatlon)=="lat"]<-"startlat" #this renames the lat col
#as start lat so as to differentiate it later from end lat (which would
#otherwise be shown as lat also).
names(sflatlon)[names(sflatlon)=="long"]<-"startlong" #as above but for
#longitude.
sflatlon<-sflatlon[,c(5,8,17,18)] #this extracts the data needed from the table.
names(sflatlon)[names(sflatlon)=="end_station_id.x"]<-"end_station_id" #this
#renames the column so that we can combine by end_station_id
sflatlon<-sflatlon[,-c(1)] #this removes the start station id so that we 3
#can compile the end station id co-ordinates
sflatlon<-right_join(sflatlon,stationssf, by=c("end_station_id")) #this
#compiles the data
names(sflatlon)[names(sflatlon)=="lat"]<-"endlat" #this renames to identify
#that the latitude is for the end station
names(sflatlon)[names(sflatlon)=="long"]<-"endlong" #this renames to identify
#that the longitude is for the end station
sflatlon<-sflatlon[,c(2,3,6,7)] #this extracts the columns that we need -
#lat and long for both start and end stations
#this code calculates the unique number of trips between two points.
sflatlon2<-sflatlon%>%
group_by(startlat, startlong, endlat, endlong)%>%
summarise(n=n())
## `summarise()` regrouping output by 'startlat', 'startlong', 'endlat' (override with `.groups` argument)
#this plots the frequency of routes in SF
sfroutes <- ggplot() +
geom_segment(data=sflatlon2, #this draws the lines between the start and
#end points
aes(x=startlong, y=startlat, xend=endlong, yend=endlat),
col="dark grey",
size=sflatlon2$n/10000) + #this scales the size of the lines
geom_point(data=sflatlon2,
aes(x=startlong, y=startlat),
colour="blue") +
geom_point(data=sflatlon2,
aes(x=endlong, y=endlat),
colour="blue", alpha=0.15) +
geom_text_repel(data=stationssf, aes(x=long, y=lat, label=name),
colour="blue", size=2.5, check_overlap=T)+ #this adds the
#station names.
coord_cartesian(ylim=c(37.77, 37.805), xlim=c(-122.42, -122.387)) +
theme(axis.line=element_blank(),
#axis.text.x=element_blank(),
#axis.text.y=element_blank(),
#axis.title.x=element_blank(),
#axis.title.y=element_blank(),
axis.ticks=element_blank(),
#panel.border=element_blank(),
panel.grid.major=element_blank(),
panel.grid.minor=element_blank(),
plot.title=element_text(hjust=0.5, size=12)) +
ggtitle("Frequency of routes in San Francisco bike study") +
ylab("Latitude")+xlab("Longitude")
## Warning: Ignoring unknown parameters: check_overlap
sfroutes
The model demonstrates that a model can be built that predicts daily ridership with a reasonable degree of accuracy, given the day of the week, whether the day is a holiday or not, the temperature on the day, the level of precipitation, cloud cover and how long the cycle scheme has been running. Based on available data, the model explains 82.6% of the variability in daily ridership. The root mean squared error is c. 150 riders per day. This compares to an average daily ridership of 818 in the period under review.
Like all models, there are some limitations inherent with this one. Firstly, there was very little rainfall in the period covered by the data. Accordingly, the model is likely to be less reliable in periods of rain. Secondly, ridership increased over the period, the model assumes that this increase will continue. While this may be a reasonable assumption in the short term, it is not reasonable to believe that the continued growth in ridership will continue to increase at the same rate. Thirdly, we note that the model does not take into account a wide range of other factors, including for example, local activities such as sporting events, that may impact ridership on individual days.
Fourthly, we note that ridership varies between stations. If the mix of stations changes (e.g. through the addition of new ones, or the removal of existing), then the model will be less reliable. Finally, no information was collected about what form of transportation the bike program replaced. Accordingly, the model does not provide any indication of the amount of traffic or pollution that the bike program reduced.
The study showed that the trial period had an average daily ridership of 818 and that ridership continued to grow over the period of study. We suggest that the study be expanded to incorporate more stations and more bikes to collect more data. We also suggest that additional information be collected so that the impact on traffic and pollution can be considered. This information could, for example, include research on what form of transportation the bike replaced.
Overall, this was a very interesting pilot and we were very pleased to have been able to study the data.